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POLYMORPHISM DETECTION 

This application is a continuation-in-part of 
08/563,762, filed November 29, 1995, and and claims the 
benefit of U.S. provisional application 60/017,260, filed May 
10, 1996, the disclosures^ of which are incorporated by 
reference in their entirety for all purposes. 

BACKGROUND OF THE INVENTION 
The relationship between structure and function of 
macromolecules is of fundamental importance in the 
understanding of biological systems. These relationships are 
important to understanding, for example, the functions of 
enzymes, structural proteins and signaling proteins, ways in 
which cells communicate with each other, as well as mechanisms 
of cellular control and metabolic feedback. 

Genetic information is critical in continuation of 
life processes . Life is substantially inf ormationally based 
and its genetic content controls the growth and reproduction 
of the organism and its complements . The amino acid sequences 
of polypeptides, which are critical features of all living 
systems, are encoded by the genetic material of the cell. 
Further, the properties of these polypeptides, e.g., as 
enzymes, functional proteins, and structural proteins, are 
determined by the sequence of amino acids which make them up. 
As structure and function are integrally related, many 
biological functions may be explained by elucidating the 
underlying structural features which provide those functions, 
and these structures are determined by the underlying genetic 
information in the form of polynucleotide sequences. Further, 
in addition to encoding polypeptides, polynucleotide sequences 
also can be involved in control and regulation of gene 
expression. It therefore follows that the determination of 
the make-up of this genetic information has achieved 
significant scientific importance. 
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As a specific example, diagnosis and treatment of a 
variety of disorders may often be accomplished through 
identification and/or manipulation of the genetic material 
which encodes for specific disease associated traits. In 
5 order to accomplish this, however, one must first identify a 
correlation between a particular gene and a particular trait. 
This is generally accomplished by providing a genetic linkage 
map through which one identifies a set of genetic markers that 
follow a particular trait. These markers can identify the 
10 location of the gene encoding for that trait within the 

genome, eventually leading to the identification of the gene. 
Once the gene is identified, methods of treating the disorder 
O that result from that gene, i.e., as a result of 

^ overexpression, constitutive expression, mutation, 

K|5 underexpression, etc., can be more easily developed. 

<£? One class of genetic markers includes variants in 

M 

Ll the genetic code termed "polymorphisms." In the course of 

*D evolution, the genome of a species can collect a number of 

JU variations in individual bases . These single base changes are 

\M0 termed single-base polymorphisms. Polymorphisms may also 

^Jf exist as stretches of repeating sequences that vary as to the 

Pi length of the repeat from individual to individual . Where 

H= these variations are recurring, e.g., exist in a significant 

percentage of a population, they can be readily used as 
25 markers linked to genes involved in mono- and polygenic 

traits. In the human genome, single-base polymorphisms occur 
roughly once per 3 00 bp. Though many of these variant bases 
appear too infrequently among the allele population for use as 
genetic markers (i.e., 51%), useful polymorphisms (e.g., those 
3 0 occurring in 2 0 to 50 % of the allele population) can be found 
approximately once per kilobase. Accordingly, in a human 
genome of approximately 3 Gb, one would expect to find 
approximately 3,000,000 of these "useful" polymorphisms. 

The use of polymorphisms as genetic linkage markers 
35 is thus of critical importance in locating, identifying and 
characterizing the genes which are responsible for specific 
traits. In particular, such mapping techniques allow for the 
identification of genes responsible for a variety of disease 



or disorder-related traits which may be used in the diagnosis 
and or eventual treatment of those disorders . Given the size 
of the human genome, as well as those of other mammals, it 
would generally be desirable to provide methods of rapidly 
identifying and screening for polymorphic genetic markers . 
The present invention meets these and other needs. 

SUMMARY OF THE INVENTION 
One aspect of the invention is an array of 
oligonucleotide probes for detecting a polymorphism in a 
target nucleic acid sequence using Principal Component 
Analysis, said array comprising at least one detection block 
of probes, said detection block including a first group of 
probes that are complementary to said target nucleic acid 
sequence except that the group of probes includes all possible 
monosubstitutions of positions in said sequence that are 
within n bases of a base in said sequence that is 
complementary to said polymorphism, wherein n is from 0 to 5, 
and a second and third group of probes complementary to 
marker- specif ic regions upstream and downstream of the target 
nucleic acid sequence, wherein the third group of probes 
differs from the second set of probes at single bases 
corresponding to known mismatch positions. 

A further aspect of the invention is a method of 
identifying whether a target nucleic acid sequence includes a 
polymorphic variant using principal component analysis, 
comprising : 

hybridizing said target nucleic acid sequence 
to said array comprising at least one detection block of 
probes, said detection block including a first group of probes 
that are complementary to said target nucleic acid sequence 
except that the group of probes includes all possible 
monosubstitutions of positions in said sequence that are 
within n bases of a base in said sequence that is 
complementary to said polymorphism, wherein n is from 0 to 5 , 
and a second and third group of probes complementary to 
marker- specif ic regions upstream and downstream of the target 
nucleic acid sequence, wherein the third group of probes 



differs from the second set of probes at single bases 
corresponding to known mismatch positions; and 

determining hybridization intensities of the target 
nucleic acid and the marker-specific regions to identify said 
polymorphic variant. In one embodiment of the invention, the 
step of determining comprises: 

a) calculating the control difference between the 
average of the hybridization intensities of the second group 
of probes, the hybridization intensities comprising control 
perfect matches (PM) , minus the average of the hybridization 
intensities, the hybridization intensities comprising control 
single-base mismatches (MM) ; 

b) calculating the possible perfect match intensity 
and a heteromismatch intensity from the hybridization 
intensities for each position of monosubstitutions of the 
first group of probes ; 

c) calculating the difference between the possible 
perfect match intensity and the heteromismatch intensity for 
each position of monosubstitutions of the first group of 
probes ; 

d) calculating a normalized difference (ND) by 
dividing the difference of step (c) by the control difference; 

(e) usinV principal component analysis, identifying 
a polymorphism by comparing normalized differences between 
individuals in a population. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 shows a schematic illustration of light - 
directed synthesis of oligonucleotide arrays. 

Figure shows a schematic representation of a 
single oligonucleotide array containing 78 separate detection 
blocks. Figure 2B shows a schematic illustration of a 
detection block for a specific polymorphism denoted WI-567. 
Figure 2B also shows Vhe triplet layout of detection blocks 
for the polymorphism employing 2 0-mer oligonucleotide probes 
having substitutions 7 A 10 and 13 bp from the 3' end of the 
probe. The probes present in the shaded portions of each 
detection block are shovra\ adjacent to each detection block. 



Figure 3 illustrates a tiling strategy for a 
polymorphism denoted w\-567, and having the sequence 5'- 
TGCTGCCTTGGTTC [A/G] AGCCGTCATCTCTTT-3 1 . A detection block 
specific for the WI-567 Polymorphism is shown with the probe 
sequences tiled therein listed above. Predicted patterns for 
both homozygous forms and the heterozygous form are shown at 
the bottom. \ 

Figure 4 \ shows a schematic representation of a 
detection block spefcific for the polymorphism denoted WI-1959 
having the sequence B ' -ACCAAAAATCAGTC [T/C] GGGTAACTGAGAGTG - 3 ' 
with the polymorphism^ indicated by the brackets . A 
fluorescent scan of hybridization of the heterozygous and both 
homozygous forms are shown in the center, with the predicted 
hybridization pattern far each being indicated below. 

Figure 5 illustrates an example of a computer system 
used to execute the software of the present invention which 
determines whether polymorphic markers in DNA are 
heterozygote, homozygote with a first polymorphic marker or 
homozygote with a second polymorphic marker. 

Figure 6 shows a system block diagram of computer 
system 1 used to execute the software of the present 
invention. 

Figure 7 shows a probe array including probes with 
base substitutions at base positions within two base positions 
of the polymorphic marker. The position of the polymorphic 
marker is denoted P 0 and which may have one of two polymorphic 
markers x and y (where x and y are one of A, C, G, or T) . 

Figure 8 shows a probe array including probes with 
base substitutions at base positions within two base positions 
of the polymorphic marker. 

Figure 9 shows a high level flowchart of analyzing 
intensities to determine whether polymorphic markers in DNA 
are heterozygote, homozygote with a first polymorphic marker 
or homozygote with a second polymorphic marker. 

Figure 10 shows a Principal Components Plot of 
Marker 219 (KRT8ml) . 



Figure 11 shows a schematic representation of a 
process for carrying out the polymorphism detection methods of 
the invention. 

Figure 12 shows the algorithms used for identifying 
genotypes, using the methods of the present invention. 

Figure 13 shows the DB scores of one marker plotted 
along with the genotypes determined by standard sequencing. 
Approximately 220 biallelic markers were assayed together for 
each individual for a sixteen member family. 

DETAILED DESCRIPTION OF THE INVENTION 

I . General 

The present invention generally provides rapid and 
efficient methods for screening samples of genomic material 
for polymorphisms, and arrays specifically designed for 
carrying out these analyses. In particular, the present 
invention relates to the identification and screening of 
single base polymorphisms in a sample. In general, the 
methods of the present invention employ arrays of 
oligonucleotide probes that are complementary to target 
nucleic acid sequence segments from an individual (e.g., a 
human or other mammal) which target sequences include specific 
identified polymorphisms, or "polymorphic markers." The 
probes are typically arranged in detection blocks, each block 
being capable of discriminating the three genotypes for a 
given marker, e.g., the heterozygote or either of the two 
homozygotes . The method allows for rapid, automatable 
analysis of genetic linkage to even complex polygenic traits. 

Oligonucleotide arrays typically comprise a 
plurality of different oligonucleotide probes that are coupled 
to a surface of a substrate in different known locations . 
These oligonucleotide arrays, also described as "Genechips™, " 
have been generally described in the art, for example, U.S. 
Patent No. 5,143,854 and PCT patent publication Nos . WO 
90/15070 and 92/10092. These arrays may generally be produced 
using mechanical synthesis methods or light directed synthesis 
methods which incorporate a combination of photolithographic 
methods and solid phase oligonucleotide synthesis methods. 
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See Fodor et al- , Science, 251:767-777 (1991), Pirrung et al . , 
U.S. Patent No. 5,143,854 (see also PCT Application No. WO 
90/15070) and Fodor et al . , PCT Publication No. WO 92/10092 
and U.S. Patent No. 5,424,186, each of which is hereby 
5 incorporated herein by reference. Techniques for the 

synthesis of these arrays using mechanical synthesis methods 
are described in, e.g., U.S. Patent No. 5,384,261, 
incorporated herein by reference in its entirety for all 
purposes . 

10 The basic strategy for light directed synthesis of 

oligonucleotides on a VLSIPS™ Array is outlined in Figure 1 . 
The surface of a substrate or solid support, modified with 
O photosensitive protecting groups (X) is illuminated through a 

*S photolithographic mask, yielding reactive hydroxyl groups in 

y3-5 the illuminated regions. A selected nucleotide, typically in 

the form of a 3 1 -O-phosphoramidite -activated deoxynucleoside 
La (protected at the 5 1 hydroxyl with a photosensitive protecting 

^ group) , is then presented to the surface and coupling occurs 

p at the sites that were exposed to light. Following capping 

CSo and oxidation, the substrate is rinsed and the surface is 
5 ~: illuminated through a second mask, to expose additional 

O hydroxyl groups for coupling. A second selected nucleotide 

r* (e.g., 5 ' -protected, 3 1 -O-phosphoramidite-activated 

deoxynucleoside) is presented to the surface. The selective 
25 deprotection and coupling cycles are repeated until the 

desired set of products is obtained. Pease et al . , Proc. 
Natl. Acad. Scl. (1994) 91:5022-5026. Since photolithography 
is used, the process can be readily miniaturized to generate 
high density arrays of oligonucleotide probes. Furthermore, 
30 the sequence of the oligonucleotides at each site is known. 

II . Identification of Polymorphisms 

The methods and arrays of the present invention 
primarily find use in the identification of so-called "useful" 
35 (i.e., those that are present in approximately 20% or more of 
the allele population) . The present invention also relates to 
the detection or screening of specific variants of previously 
identified polymorphisms. 



ft 
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A wide variety of methods can be used to identify- 
specif ic polymorphisms. For example, repeated sequencing of 
genomic material from large numbers of individuals, although 
extremely time consuming, can be used to identify such 
5 polymorphisms. Alternatively, ligation methods may be used, 
where a probe having an overhang of defined sequence is 
ligated to a target nucleotide sequence derived from a number 
of individuals. Differences in the ability of the probe to 
ligate to the target can reflect polymorphisms within the 
10 sequence. Similarly, restriction patterns generated from 

treating a target nucleic acid with a prescribed restriction 
enzyme or set of restriction enzymes can be used to identify 
Q polymorphisms. Specifically, a polymorphism may result in the 

.==* presence of a restriction site in one variant but not m 

Ld5 another. This yields a difference in restriction patterns for 
the two variants, and thereby identifies a polymorphism. In a 
U related method, U.S. Patent Application Serial No. 08/485,606, 

^ filed June 7, 1995 describes a method of identifying 

p polymorphisms using type -I Is endonucleases to capture 

CpO ambiguous base sequences adjacent the restriction sites, and 
"jS characterizing the captured sequences on oligonucleotide 

Q arrays. The patterns of these captured sequences are compared 

^ from various individuals, the differences being indicative of 

potential polymorphisms. 
25 In a preferred aspect, the identification of 

polymorphisms takes into account the assumption that a useful 
polymorphism (i.e., one that occurs in 20 to 50% of the allele 
population) occurs approximately once per 1 KB in a given 
genome. In particular, random sequences of a genome, e.g., 
3 0 random 1 kb sequences of the human genome such as expressed 
sequence tags or "ESTs", can be sequenced from a limited 
number of individuals. When a variant base is detected with 
sufficient frequency, it is designated a "useful" 
polymorphism. In practice, the method generally analyzes the 
3 5 same 1 KB sequence from a small number of unrelated 

individuals, i.e., from 3 to 5 (6 to 10 alleles). Where a 
variant sequence is identified, it is then compared to a 
separate pool of material from unrelated individuals (i.e., 10 



unrelated individuals) . Where the variant sequence identified 
from the first set of individuals is detectable in the pool of 
the second set, it is assumed to exist at a sufficiently high 
frequency, e.g., at least about 20% of the allele population, 
thereby qualifying as a useful marker for genetic linkage 
analysis. 

III. Screening Polymorphisms 

Screening polymorphisms in samples of genomic 
material according to the methods of the present invention, is 
generally carried out using arrays of oligonucleotide probes. 
These arrays may generally be "tiled" for a large number of 
specific polymorphisms. By "tiling" is generally meant the 
synthesis of a defined set of oligonucleotide probes which is 
made up of a sequence complementary to the target sequence of 
interest, as well as preselected variations of that sequence, 
e.g., substitution of one or more given positions with one or 
more members of the basis set of monomers, i.e. nucleotides. 
Tiling strategies are discussed in detail in Published PCT 
Application No. WO 95/11995, incorporated herein by reference 
in its entirety for all purposes. By "target sequence" is 
meant a sequence which has been identified as containing a 
polymorphism, and more particularly, a single-base 
polymorphism, also referred to as a "biallelic base." It will 
be understood that the term "target sequence" is intended to 
encompass the various forms present in a particular sample of 
genomic material, i.e., both alleles in a diploid genome. 

In a particular aspect, arrays are tiled for a 
number of specific, identified polymorphic marker sequences. 
In particular, the array is tiled to include a number of 
detection blocks, each detection block being specific for a 
specif ic -polymorphic marker or set of polymorphic markers. 
For example, a detection block may be tiled to include a 
number of probes which span the sequence segment that includes 
a specific polymorphism. To ensure probes that are 
complementary to each variant, the probes are synthesized in 
pairs differing at the biallelic base. 



In addition to the probes differing at the biallelic 
bases, monosubstituted probes are also generally tiled within 
the detection block. These monosubstituted probes have bases 
at and up to a certain number of bases in either direction 
from the polymorphism, substituted with the remaining 
nucleotides (selected from A, T, G, C or U) . Typically, the 
probes in a tiled detection block will include substitutions 
of the sequence positions up to and including those that are 5 
bases away from the base that corresponds to the polymorphism. 
Preferably, bases up to and including those in positions 2 
bases from the polymorphism will be substituted. The 
monosubstituted probes provide internal controls for the tiled 
array, to distinguish actual hybridization from artifactual 
cross-hybridization. An example of this preferred 
substitution pattern is shown in Figure 3 . 

A variety of tiling configurations may also be 
employed to ensure optimal discrimination of perfectly 
hybridizing probes. For example, a detection block may be 
tiled to provide probes having optimal hybridization 
intensities with minimal cross-hybridization. For example, 
where a sequence downstream from a polymorphic base is G-C 
rich, it could potentially give rise to a higher level of 
cross-hybridization or "noise, " when analyzed. Accordingly, 
one can tile the detection block to take advantage of more of 
the upstream sequence. Such alternate tiling configurations 
are schematically illustrated in Figure 2B, bottom, where the 
base in the probe that is complementary to the polymorphism is 
placed at different positions in the sequence of the probe 
relative to the 3' end of the probe. For ease of discussion, 
both the base which represents the polymorphism and the 
complementary base in the probe are referred to herein as the 
"polymorphic base" or "polymorphic marker. " 

Optimal tiling configurations may be determined for 
any particular polymorphism by comparative analysis. For 
example, triplet or larger detection blocks like those 
illustrated in Figure 2B may be readily employed to select 
such optimal tiling strategies. 



Additionally, arrays will generally be tiled to 
provide for ease of reading and analysis. For example, the 
probes tiled within a detection block will generally be 
arranged so that reading across a detection block the probes 
are tiled in succession, i.e., progressing along the target 
sequence one or more base at a time (See, e.g., Figure 3, 
middle) . 

Once an array is appropriately tiled for a given 
polymorphism or set of polymorphisms, the target nucleic acid 
is hybridized with the array and scanned. Hybridization and 
scanning are generally carried out by methods described in, 
e.g., Published PCT Application Nos . WO 92/10092 and WO 
95/11995, and U.S. Patent No. 5,424,186, previously 
incorporated herein by reference in their entirety for all 
purposes. In brief, a target nucleic acid sequence which 
includes one or more previously identified polymorphic markers 
is amplified by well known amplification techniques, e.g., 
PCR. Typically, this involves the use of primer sequences 
that are complementary to the two strands of the target 
sequence both upstream and downstream from the polymorphism. 
Asymmetric PCR techniques may also be used. Amplified target, 
generally incorporating a label, is then hybridized with the 
array under appropriate conditions. Upon completion of 
hybridization and washing of the array, the array is scanned 
to determine the position on the array to which the target 
sequence hybridizes. The hybridization data obtained from the 
scan is typically in the form of fluorescence intensities as a 
function of location on the array. 

Although primarily described in terms of a single 
detection block, e.g., for detection of a single polymorphism, 
in preferred aspects, the arrays of the invention will include 
multiple detection blocks, and thus be capable of analyzing 
multiple, specific polymorphisms. For example, preferred 
arrays will generally include from about 5 0 to about 4 000 
different detection blocks with particularly preferred arrays 
including from 100 to 3000 different detection blocks. 

In alternate arrangements, it will generally be 
understood that detection blocks may be grouped within a 
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single array or in multiple, separate arrays so that varying, 
optimal conditions may be used during the hybridization of the 
target to the array. For example, it may often be desirable 
to provide for the detection of those polymorphisms that fall 
5 within G-C rich stretches of a genomic sequence, separately 

from those falling in A-T rich segments. This allows for the 
separate optimization of hybridization conditions for each 
situation. 

10 IV. Calling 

After hybridization and scanning, the hybridization 
data from the scanned array is then analyzed to identify which 
C variant or variants of the polymorphic marker are present in 

^ the sample, or target sequence, as determined from the probes 

ydL5 to which the target hybridized, e.g., one of the two 
~ homozygote forms or the heterozygote form. This determination 

L is termed "calling" the genotype. Calling the genotype is 

^3 typically a matter of comparing the hybridization data for 

f=% each potential variant, and based upon that comparison, 

S3 0 identifying the actual variant (for homozygotes) or variants 
m (for heterozygotes) that are present. In one aspect, this 

□ comparison involves taking the ratio of hybridization 

^ intensities (corrected for average background levels) for the 

expected perfectly hybridizing probes for a first variant 
25 versus that of the second variant. Where the marker is 

homozygous for the first variant, this ratio will be a large 
number, theoretically approaching an infinite value. Where 
homozygous for the second variant, the ratio will be a very 
low number, i.e., theoretically approaching zero. Where the 
30 marker is heterozygous, the ratio will be approximately 1. 

These numbers are, as described, theoretical. Typically, the 
first ratio will be well in excess ofl, i.e., 2, 4, 5 or 
greater. Similarly, the second ratio will typically be 
substantially less than 1, i.e., 0.5, 0.2, 0.1 or less. The 
35 ratio for heterozygotes will typically be approximately equal 
to 1, i.e. from 0.7 to 1.5. These ratios can vary based upon 
the specific sequence surrounding the polymorphism, and can 



also be adjusted based upon a standard hybridization with a 
control sample containing the variants of the polymorphism. 

The quality of a given call for a particular 
genotype may also be checked. For example, the maximum 
perfect match intensity can be divided by a measure of the 
background noise (which may be represented by the standard 
deviation of the mismatched intensities) . Where the ratio 
exceeds some preselected cut-off point, the call is determined 
to be good. For example, where the maximum intensity of the 
expected perfect matches exceeds twice the noise level, it 
might be termed a good call. In an additional aspect, the 
present invention provides software for performing the above 
described comparisons. 

Fig. 5 illustrates an example of a computer system 
used to execute the software of the present invention which 
determines whether polymorphic markers in DNA are 
heterozygote , homozygote with a first variant of a 
polymorphism or homozygote with a second variant of a 
polymorphism. Fig. 5 shows a computer system 1 which includes 
a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse 11. 
Mouse 11 may have one or more buttons such as mouse buttons 
13 . Cabinet 7 houses a CD-ROM drive 15 or a hard drive (not 
shown) which may be utilized to store and retrieve software 
programs incorporating the present invention, digital images 
for use with the present invention, and the like. Although a 
CD-ROM 17 is shown as the removable media, other removable 
tangible media including floppy disks, tape, and flash memory 
may be utilized. Cabinet 7 also houses familiar computer 
components (not shown) such as a processor, memory, and the 
like. 

Fig. 6 shows a system block diagram of computer 
system 1 used to execute the software of the present 
invention. As in Fig. 5, computer system 1 includes monitor 3 
and keyboard 9. Computer system 1 further includes subsystems 
such as a central processor 102, system memory 104, I/O 
controller 106, display adapter 108, removable disk 112, fixed 
disk 116, network interface 118, and speaker 120. Other 
computer systems suitable for use with the present invention 
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may include additional or fewer subsystems. For example, 
another computer system could include more than one processor 
102 (i.e., a multi-processor system) or a cache memory. 

Arrows such as 122 represent the system bus 
architecture of computer system 1. However, these arrows are 
illustrative of any interconnection scheme serving to link the 
subsystems. For example, a local bus could be utilized to 
connect the central processor to the system memory and display 
adapter. Computer system 1 shown in Fig. 6 is but an example 
of a computer system suitable for use with the present 
invention. Other configurations of subsystems suitable for 
use with the present invention will be readily apparent to one 
of ordinary skill in the art. 

Fig. 7 shows a probe array including probes with 
base substitutions at base positions within two base positions 
of the polymorphic marker. The position of the polymorphic 
marker is denoted P 0 and which may have one of two variants of 
the polymorphic markers x and y (where x and y are one of A, 
C, G, or T) . As indicated, at P_ 2 there are two columns of 
four cells which contain a base substitution two base 
positions to the left, or 3 ' , from the polymorphic marker. 
The column denoted by an "x" contains polymorphic marker x and 
the column denoted by a "y" contains polymorphic marker y. 

Similarly, P_! contains probes with base 
substitutions one base position to the left, or 3', of the 
polymorphic marker. P 0 contains probes with base 
substitutions at the polymorphic marker position. 
Accordingly, the two columns in P 0 are identical. P 1 and P 2 
contain base substitutions one and two base positions to the 
right, or 5 1 , of the polymorphic marker, respectively. 

As a hypothetical example, assume a single base 
polymorphism exists where one allele contains the subsequence 
TCAAG whereas another allele contains the subsequence TCGAG, 
where the underlined base indicates the polymorphism in each 
allele. Fig. 8 shows a probe array including probes with base 
substitutions at base positions within two base positions of 
the polymorphic marker. In the first two columns, the cells 
which contain probes with base A (complementary to T in the 
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alleles) two positions from the left of the polymorphic marker 
are shaded. They are shaded to indicate that it is expected 
that these cells would exhibit the highest hybridization to 
the labeled sample nucleic acid. Similarly, the second two 
5 columns have cells shaded which have probes with base G 

(complementary to C in the alleles) one position to the left 
of the polymorphic marker. 

At the polymorphic marker position (corresponding to 
P 0 in Fig. 7), there are two columns: one denoted by an "A" 
10 and one denoted by a "G" . Although, as indicated earlier, the 
probes in these two columns are identical, the probes contain 
base substitutions for the polymorphic marker position. An 
p "N" indicates the cells that have probes which are expected to 

=f exhibit a strong hybridization if the allele contains a 

L4.5 polymorphic marker A. As will become apparent in the 
d following paragraphs, "N" stands for numerator because the 

L± intensity of these cells will be utilized in the numerator of 

*D an equation. Thus, the labels were chosen to aid the reader's 

understanding of the present invention. 
£P0 A "D" indicates the cells that have probes which are 

^ expected to exhibit a strong hybridization if the allele 

5 contains a polymorphic marker G. "D" stands for denominator 

N°" because the intensity of these cells will be utilized in the 

denominator of an equation. The "n" and "d" labeled cells 
25 indicate these cells contain probes with a single base 

mismatch near the polymorphic marker. As before, the labels 
indicate where the intensity of these cells will be utilized 
in a following equation. 

Fig. 9 shows a high level flowchart of analyzing 
3 0 intensities to determine whether polymorphic markers in DNA 
are heterozygote , homozygote with a first polymorphic marker 
or homozygote with a second polymorphic marker. At step 202, 
the system receives the fluorescent intensities of the cells 
on the chip. Although in a preferred embodiment, the 
3 5 hybridization of the probes to the sample are determined from 
fluorescent intensities, other methods and labels including 
radioactive labels may be utilized with the present invention. 
An example of one embodiment of a software program for 



carrying out this analysis is reprinted in Software Appendix 
A. 

A perfect match (PM) average for a polymorphic 
marker x is determined by averaging the intensity of the cells 
at P 0 that have the base substitution equal to x in Fig. 7. 
Thus, for the example in Fig. 8, the perfect match average for 
A would add the intensities of the cells denoted by "N" and 
divide the sum by 2 . 

A mismatch (MM) average for a polymorphic marker x 
is determined by averaging the intensity of the cells that 
contain the polymorphic marker x and a single base mismatch in 
Fig. 7. Thus, for the example in Fig. 8, the mismatch average 
for A would be the sum of cells denoted by "n" and dividing 
the sum by 14 . 

A perfect match average and mismatch average for 
polymorphic marker y is determined in a similar manner 
utilizing the cells denoted by "D" and "d", respectively. 
Therefore, the perfect match averages are an average intensity 
of cells containing probes that are perfectly complementary to 
an allele. The mismatch averages are an average of intensity 
of cells containing probes that have a single base mismatch 
near the polymorphic marker in an allele. 

At step 2 04, the system calculates a Ratio of the 
perfect match and mismatch averages for x to the perfect match 
and mismatch averages for y. The numerator of the Ratio 
includes the mismatch average for x subtracted from the 
perfect mismatch for x. In a preferred embodiment, if the 
resulting numerator is less than 0, the numerator is set equal 
to 0. 

The denominator of the Ratio includes the mismatch 
average for y subtracted from the perfect mismatch for y. In 
a preferred embodiment, if the resulting denominator is less ( ~ 
than or equal to 0, the numerator is set equal to a de minimum 
value like 0.00001. j 

Once the system has calculated the Ratio, the system 
calculates DB at step 206. DB is calculated by the equation 
DB = 10*log 10 Ratio . The logarithmic function puts the ratio on 



a linear scale and makes it easier to interpret the results of 
the comparison of intensities. 

At step 208, the system performs a statistical check 
on the data or hybridization intensities. The statistical 
check is performed to determine if the data will likely 
produce good results. In a preferred embodiment, the 
statistical check involves testing whether the maximum of the 
perfect match averages for x or y is at least twice as great 
as the standard deviation of the intensities of all the cells 
containing a single base mismatch (i.e., denoted by a "n" or 
"d" in Fig. 8) . If the perfect match average is at least two 
times greater than this standard deviation, the data is likely 
to produce good results and this is communicated to the user. 

The system analyzes DB at step 210 to determine if 
DB is approaching near 0, or approaching +°°. If DB is 

approaching a negative infinity, the system determines that 
the sample DNA contains a homozygote with a first polymorphic 
marker corresponding to x at step 212. If DB is near 0, the 
system determines that the sample DNA contains a heterozygote 
corresponding to both polymorphic markers x and y at step 214. 
Although described as approaching °° , etc., as described 
previously, these numbers will generally vary, but are 
nonetheless indicative of the calls described. If DB is 
approaching a positive infinity, the system determines that 
the sample DNA contains a homozygote with a second polymorphic 
marker corresponding to y at step 216. 

A visual inspection of the Ratio equation in step 
204 shows that the numerator should be higher than the 
denominator if the DNA sample only has the polymorphic marker 
corresponding to x. Similarly, the denominator should be 
higher than the numerator if the DNA sample only has a 
polymorphic marker corresponding to y. If the DNA sample has 
both polymorphic markers, indicating a heterozygote, the Ratio 
should be approximately equal to 1 which results in a 0 when 
the logarithm of the Ratio is calculated. 

The equations discussed above illustrate just one 
embodiment of the present invention. These equations have 
correctly identified polymorphic markers when a visual 
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inspection would seem to indicate a different result . This 
may be the case because the equations take into account the 
mismatch intensities in order to determine the presence or 
absence of the polymorphic markers. 

Those of skill in the art, upon reading the instant 
disclosure will appreciate that a variety of modifications to 
the above described methods and arrays may be made without 
departing from the spirit or scope of the invention. For 
example, one may select the strand of the target sequence to 
optimize the ability to call a particular genome. 
Alternatively, one may analyze both strands, in parallel, to 
provide greater amounts of data from which a call can be made. 
Additionally, the analyses, i.e., amplification and scanning 
may be performed using DNA, RNA, mixed polymers, and the like. 

The present invention is further illustrated by the 
following examples. These examples are merely to illustrate 
aspects of the present invention and are not intended as 
limitations of this invention. 

V. Examples 

Example i- chip Tiling 

A DNA chip is prepared which contains three 
detection blocks for each of 78 identified single base 
polymorphisms or biallelic markers, in a segment of human DNA 
(the "target" nucleic acid) . Each detection block contains 
probes wherein the identified polymorphism occurs at the 
position in the target nucleic acid complementary to the 7th, 
10th and 13th positions from the 3' end of 20-mer 
oligonucleotide probes. A schematic representation of a 
single oligonucleotide array containing all 78 detection 
blocks is shown in Figure 2A. 

The tiling strategy for each block substitutes bases 
in the positions at, and up to two bases, in either direction 
from the polymorphism. In addition to the substituted 
positions, the oligonucleotides are synthesized in pairs 
differing at the biallelic base. Thus, the layout of the 
detection block (containing 40 different oligonucleotide 
probes) allows for controlled comparison of the sequences 
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involved, as well as simple readout without need for 
complicated instrumentation. A schematic illustration of this 
tiling strategy within a single detection block is shown in 
Figure 3, for a specific polymorphic marker denoted WI-567. 
5 Example 2- Detection of Polymorphisms 

A target nucleic acid is generated from PCR products 
amplified by primers flanking the markers. These amplicons 
can be produced singly or in multiplexed reactions. Target 
can be produced as ss-DNA by asymmetric PCR from one primer 
10 flanking the polymorphism or as RNA transcribed in vitro from 
promoters linked to the primers. Fluorescent label is 
introduced into target directly as dye-bearing nucleotides, or 
O bound after amplification using dye-streptavidin complexes to 

*9 incorporated biotin containing nucleotides. In DNA produced 

yi-5 by asymmetric PCR fluorescent dye is linked directly to the 5 1 
'fj end of the primer. 

P Hybridization of target to the arrays tiled in 

Example 1, and subsequent washing are carried out with 
m standard solutions of salt (SSPE, TMACl) and nonionic 

tt^O detergent (Triton-XlOO ) , with or without added organic solvent 
\J (formamide) . Targets and markers generating strong signals 

5 are washed under stringent hybridization conditions (37-40°C; 

H= io% formamide; 0.25xSSPE washes) to give highly discriminating 

detection of the genotype. Markers giving lower hybridization 
25 intensity are washed under less stringent conditions (^30°C; 
3M TMACl, or 6xSSPE; 6x and lx SSPE washes) to yield highly 
discriminating detection of the genotype. 

Detection of one polymorphic marker is illustrated 
in Figure 3 . Specifically, a typical detection block is shown 
30 for the polymorphism denoted WI-1959, having the sequence 5'- 
ACCAAAAATCAGTC [T/C] GGGTAACTGAGAGTG-3 ' with the polymorphism 
indicated by the brackets (Figure 3, top), for which all three 
genotypes are available (T/C heterozygote , C/C homozygote and 
T/T homozygote) . The expected hybridization pattern for the 
35 homozygote and heterozygote targets are shown in Figure 3, 

bottom. Three chips were tiled with each chip including the 
illustrated detection block. Each block contained probes 
having the substituted bases at the 7th, 10th and 13th 



positions from the 3 ' end of 20-mer oligonucleotide probes 
(20/7, 20/10 and 20/13, respectively). These alternate 
detection blocks were tiled to provide a variety of sequences 
flanking the polymorphism itself, to ensure at least one 
detection block hybridizing with a sufficiently low background 
intensity for adequate detection. 

Fluorouracil containing RNA was synthesized from a 
T7 promoter on the upstream primer, hybridized to the 
detection array in 6xSSPE + Triton-XlOO at 3 0°C, and washed in 
0.25xSSPE at room temperature. As shown in the scan Figure 3, 
middle, fluorescent scans of the arrays correctly identified 
the 5 homozygote or 10 heterozygote features. 

Example 3- Alternate Gene Calling Method 

An alternate method for calling the genotypes of a 
pedigree (or any collection of individuals) from P246 chip 
data is described herein. In particular, each sample from 
each of the individuals studied is amplified and hybridized to 
a P24 6 chip. The 24 6 chip employs a poly- tiling scheme and 
contains marker- specif ic control probes covering regions 
upstream and downstream from the single-base polymorphism. 
The significance of the control probes is that given that the 
target sample is amplified, these probes will display both 
perfect matches (PM) and single-base mismatches (MM) at known 
mismatch positions regardless of the target genotype. Even 
though in this document our description of the genotype 
calling method is based on the intensity data of a specific 
offset block (7/20, 10/20, 13/20) and a specific strand (T7, 
T3), this method is easily generalizable to accommodate data 
combining multiple offset blocks and strands. 

Considering the data for a collection of individuals 
for a given marker and a given offset block and strand, the 
relevant raw data for each individual are (1) two control PM 
intensities, (2) the corresponding (mismatch position = 
offset) control MM intensities, (3) 40 block intensities, (4) 
interrogations for each of 2 alleles for each of 5 positions. 
The averages of the two control PMs and of the two control MMs 
are computed. The difference of the two averages (PM - MM) is 
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labeled the Control Difference. For each of the 10 sets of 4 
intensities (5 positions for each allele) , the "possible" PM 
intensity is identified. (Note that in individuals where the 
allele is present a PM will occur in a predetermined probe 
from the set of 4 interrogation probes, that probe is what we 
call "possible" PM.) The hetero-MM probe (the same nucleotide 
appears at the mismatch position in both strands) is selected 
from the remaining 3, and the PM-MM difference is calculated. 
Each of the 10 block PM-MM values is divided by the Control 
Difference giving a Normalized Difference (ND) . Thus the data 
for each individual are now reduced from 44 values to 10 ND 
values, 5 for each allele. 

Two principal components analyses (PCA) are then 
performed (see Figure 10); one for each allele. PCA 
methodology originated with K. Pearson (19 01) Philosophical 
Magazine, 2,559-572 as a means of fitting planes to data by 
orthogonal least squares, but was later proposed by Hotelling 
(1933) Journal of Educational Psychology, 26:417-441, 498-520 
and Hotelling (1936) Psychometrica, 1:27-35. for the 
particular purpose of analyzing correlation structure. The 
correlation structure analyzed in our case is that of the 5 ND 
values correlated over individuals. PCA attempts to find 
hierarchical sets of coefficients so that the simple weighted 
average of the ND values using the first set of coefficients 
would account for the largest portion of the variability among 
individuals . The second set would account for the largest 
portion of remaining variability with the constraint that it 
is orthogonal (non-overlapping) to the first, and so on. 
Without being bound to a particular theory, it is believed 
that the major source of variability among individuals on the 
5 ND scores is mainly due to the difference between those with 
the given allele and those without. Thus, it can be expected 
that the first principal component would capture this 
difference, so that the weighted average based on the first 
set of coefficients is computed, individuals with the allele 
will generally have high scores and those without will have 
low scores. Moreover, combining the PCA results from the two 
alleles will distinguish between homozygous individuals with 



the first allele (high PC scores on first low on second) , 
homozygous individuals with the second allele (high PC scores 
on second low on first) and heterozygous individuals who will 
be high on both. This is illustrated in the enclosed plot of 
the two principal components for a set of 16 individuals from 
the K104 CEPH family. Clearly the individuals can be divided 
into three groups corresponding to the indicated genotypes. 

Among the 221 biallelic markers detected on the 
polymorphism chip, seventeen of these markers were selected 
and assayed in a sixteen member CEPH family for genotyping by 
the present methods (see Figure 14) and by ABI sequencing. Of 
the 272 genotypes called between the two methods, there were 
only three disagreements (~90% concordance) . All the 
genotypes called by either method were consistent with 
Mendelian inheritance. 

While the foregoing invention has been described in 
some detail for purposes of clarity and understanding, it will 
be clear to one skilled in the art from a reading of this 
disclosure that various changes in form and detail can be made 
without departing from the true scope of the invention. All 
publications and patent . documents cited in this application 
are incorporated by reference in their entirety for all 
purposes to the same extent as if each individual publication 
or patent document were so individually denoted. 
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APPENDIX A 
SOFTWARE APPENDIX 



. til* (115 x 1391 



i umcci n«o 



0^ 



racpaceutoff • 



aa<3. 0) 

ocJS.O) • 
aai6.03 ■ 
oc(6.0] • 
MiT.I) 
«<7.0> • 
aa<a.O) 

IKlt.O) • 
MI9.0I 

□t(i.o) - 



UMI2.1! 
1MK12.11 • 
HHD.tl ' 

h«t<3.ii - 



•Mli.ll < 

MKl5.ll - 
bmK.11 
not £.11 - 

UMl7.ll ' 
M7.1] - 
MMiB.il ' 
bnll.l) - 

mkIS.11 - 

Dul(i?3i 

MK13.31 - 
naa»<1.2] < 
hall.3l - 

MK(4.2] • ' 

ABl5.il • 

MKiC.31 > 1 
»lf.31 ' 
MK17.21 - 
uwU.31 ■ 
ltaill.31 - 
UMlt.ll ' 
linll.21 - 

MkIO.J) - ' 
no.Il. 3 I - 

MSll.ll " ' 
Mk(3.3! - 

h«(3.i) • 

^44.3) - 

MMiS.ll 

hulS.ll - 
Aali.l! - 
MK17.31 - 

MMii.ll 



•n»o«ti52T/c: • 

•XPOH3S0T/C! * 



# 



24 




(Si - /i*-ia-«>/ : - • w«-xi/> 
inuuii ii r»wo*t*is:.s:; - «i 

(S1»J 44 *a»4» If ISKli; 44 «3<12«l 



oy • Si -pyo 

Dloc* ■ 3'tintiey/S! )♦•> 
if <try%5 !• 4 44. rax i- 1 

■o . oaaa<by%51 



»i.2f\r 

• print £ •M*J»«ll\eaSTBIJt\tJIATIS\t\tO»\CCMeCX\C\Cj'ATRXT\n- 

lor ipy»*ipy«»ipy~' £ = r <pxmO:px«iO;p*»»i ii ipy < 7 II 

J mIO) • auB»r.rm«xipx.pyJ .1.1) 
mil) • aunaertnaxipx.pyl .1.1) 

mil) * tuucr inaxipx.pyi , I. 11 
v i4l • auaacrmoxipx. pyj . 1.31 
m(51 - ■uaacrtnuipx.pyi .3.2) 
m(6) • aunacrinaaipx.py) .5. 1) 
n(7) • •uaaermaxtpx.py) . 5. 1) 
mil) - •uucrinuipx.pyl.(.l) 
ml*] - aunatrinasiipx.pyl , 6.1) 

luucrinuipx.syl .3.11 '/-ruDacr tnaxlsx.pyj . < 
. mlOJ • -re 12 J •fe«ntar-)-T«|6]**ntej 
• ( -p»»l • . *py»f : • naatalpx.py) " v - • pnuMi 



: j'-O.-j — J.-j 

' QlOCX r 



(ok i*-0. ) for lb.Oitx»J.I 
■ i - mcOc/3) 



.Olpjc.py.oloeJcDaaai'to) .k) 



if loati — 0 



26 



nexipx.pyi . i . : 



if (baaaib) • 



<i-0.t<S.-r~ » 




If (hamapruie ■ 



prlacx *\t20/"biocit*\i* 
pnncl t*ti.2C\t* . ntul 
if (ratio « 100001 pruci '\:" 
rat ■ ratio 

if (ratio 0) rat ■ .00001 
laerac • loairat i /loa(10) 
prantx »'ta.2*\f. IS'losrat) 



if H 
if l« 
printf i*«3.2£' 



racpatcutoff ) prxntz ' 



